# Metric Card for MAUVE

## Metric description

MAUVE is a library built on PyTorch and HuggingFace Transformers to measure the gap between neural text and human text with the eponymous MAUVE measure. It summarizes both Type I and Type II errors measured softly using [Kullback–Leibler (KL) divergences](https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence).

This metric is a wrapper around the [official implementation](https://github.com/krishnap25/mauve) of MAUVE.

For more details, consult the [MAUVE paper](https://arxiv.org/abs/2102.01454).


## How to use 

The metric takes two lists of strings of tokens separated by spaces: one representing `predictions` (i.e. the text generated by the model) and the second representing `references` (a reference text for each prediction):

```python
from datasets import load_metric
mauve = load_metric('mauve')
predictions = ["hello world", "goodnight moon"]
references = ["hello world",  "goodnight moon"]
mauve_results = mauve.compute(predictions=predictions, references=references) 
```

It also has several optional arguments:

`num_buckets`: the size of the histogram to quantize P and Q. Options: `auto` (default) or an integer.

`pca_max_data`: the number data points to use for PCA dimensionality reduction prior to clustering. If -1, use all the data. The default is `-1`.

`kmeans_explained_var`: amount of variance of the data to keep in dimensionality reduction by PCA. The default is `0.9`.

`kmeans_num_redo`: number of times to redo k-means clustering (the best objective is kept). The default is `5`.

`kmeans_max_iter`: maximum number of k-means iterations. The default is `500`.

`featurize_model_name`: name of the model from which features are obtained, from one of the following: `gpt2`, `gpt2-medium`, `gpt2-large`, `gpt2-xl`. The default is `gpt2-large`.

`device_id`: Device for featurization. Supply a GPU id (e.g. `0` or `3`) to use GPU. If no GPU with this id is found, the metric will use CPU.

`max_text_length`: maximum number of tokens to consider. The default is `1024`.

`divergence_curve_discretization_size` Number of points to consider on the divergence curve. The default is `25`.

`mauve_scaling_factor`: Hyperparameter for scaling. The default is `5`.

`verbose`: If `True` (default), running the metric will print running time updates.

`seed`: random seed to initialize k-means cluster assignments, randomly assigned by default.
    


## Output values

This metric outputs a dictionary with 5 key-value pairs:

`mauve`: MAUVE score, which ranges between 0 and 1. **Larger** values indicate that P and Q are closer.

`frontier_integral`: Frontier Integral, which ranges between 0 and 1. **Smaller** values indicate that P and Q are closer.

`divergence_curve`: a numpy.ndarray of shape (m, 2); plot it with `matplotlib` to view the divergence curve.

`p_hist`: a discrete distribution, which is a quantized version of the text distribution `p_text`.
 
`q_hist`: same as above, but with `q_text`.


### Values from popular papers

The [original MAUVE paper](https://arxiv.org/abs/2102.01454) reported values ranging from 0.88 to 0.94 for open-ended text generation using a text completion task in the web text domain. The authors found that bigger models resulted in higher MAUVE scores, and that MAUVE is correlated with human judgments.


## Examples 

Perfect match between prediction and reference:

```python
from datasets import load_metric
mauve = load_metric('mauve')
predictions = ["hello world", "goodnight moon"]
references = ["hello world",  "goodnight moon"]
mauve_results = mauve.compute(predictions=predictions, references=references) 
print(mauve_results.mauve)
1.0
```

Partial match between prediction and reference:

```python
from datasets import load_metric
mauve = load_metric('mauve')
predictions = ["hello world", "goodnight moon"]
references = ["hello there", "general kenobi"]
mauve_results = mauve.compute(predictions=predictions, references=references) 
print(mauve_results.mauve)
0.27811372536724027
```

## Limitations and bias

The [original MAUVE paper](https://arxiv.org/abs/2102.01454) did not analyze the inductive biases present in different embedding models, but related work has shown different kinds of biases exist in many popular generative language models including GPT-2 (see [Kirk et al., 2021](https://arxiv.org/pdf/2102.04130.pdf), [Abid et al., 2021](https://arxiv.org/abs/2101.05783)). The extent to which these biases can impact the MAUVE score has not been quantified.

Also, calculating the MAUVE metric involves downloading the model from which features are obtained -- the default model, `gpt2-large`, takes over 3GB of storage space and downloading it can take a significant amount of time depending on the speed of your internet connection. If this is an issue, choose a smaller model; for instance `gpt` is 523MB.


## Citation

```bibtex
@inproceedings{pillutla-etal:mauve:neurips2021,
  title={MAUVE: Measuring the Gap Between Neural Text and Human Text using Divergence Frontiers},
  author={Pillutla, Krishna and Swayamdipta, Swabha and Zellers, Rowan and Thickstun, John and Welleck, Sean and Choi, Yejin and Harchaoui, Zaid},
  booktitle = {NeurIPS},
  year      = {2021}
}
```

## Further References 
- [Official MAUVE implementation](https://github.com/krishnap25/mauve)
- [Hugging Face Tasks - Text Generation](https://huggingface.co/tasks/text-generation)
